Accurate Unlexicalized Parsing for Modern Hebrew
نویسندگان
چکیده
Many state-of-the-art statistical parsers for English can be viewed as Probabilistic Context-Free Grammars (PCFGs) acquired from treebanks consisting of phrase-structure trees enriched with a variety of contextual, derivational (e.g., markovization) and lexical information. In this paper we empirically investigate the applicability and adequacy of the unlexicalized variety of such parsing models to Modern Hebrew, a Semitic language that differs in structure and characteristics from English. We show that contrary to experience with parsing the WSJ, the markovized, head-driven unlexicalized variety does not necessarily outperform plain PCFGs for Semitic languages. We demonstrate that enriching unlexicalized PCFGs with morphologically marked agreement features percolated up the parse tree (e.g., definiteness) outperforms plain PCFGs as well as a simple head-driven variation on the MH treebank. We further show that an (unlexicalized) head-driven variety enriched with the same features achieves even better performance. We conclude that morphologically rich languages introduce an additional dimension of parametrization that is orthogonal to the horizontal/vertical dimensions proposed before [11] and its contribution is essential and complementary. Parsing Modern Hebrew (MH) as a field of study is in its infancy. Although a syntactically annotated corpus has been available for quite some time [15] we know of only two studies attempting to parse MH using supervised methods. The reason state-of-the-art parsing models are not immediately applicable to MH is not only that their adaptation to the MH data and annotation scheme is not trivial, but also that they do not guarantee to yield comparable results. The MH treebank is small, the internal phraseand clause-structures are relatively flat and variable, multiple annotated dependencies complicate the selection of a single syntactic head, and a plentiful of disambiguating morphological features are not exploited by current state-of-the-art models for parsing, e.g., English. This paper provides a theoretical overview of the MH data and an empirical evaluation of different dimensions of parameters for learning treebank grammars which break independence assumptions irrelevant for Semitic languages. We illustrate the utility of a three-dimensional parametrization space for parsing MH and obtain accuracy results that are comparable to those obtained for Modern Standard Arabic (75%) using a lexicalized parser [1] and a much larger treebank. 1 The studies we know of are [15] which uses a DOP tree-gram model and 500 training sentences, and [16] which uses a treebank PCFG in an integrated system for morphological and syntactic disambiguation. Both achieved around 60-70% accuracy. 2 Reut Tsarfaty and Khalil Sima’an 1 Dimensions of Unlexicalized Parsing The factor that sets apart vanilla treebank Probabilistic Context-Free Grammars (PCFGs) [3] from unlexicalized extensions as proposed by, e.g., [10, 11], is the choice of statistical parametrization that embodies weaker independence assumptions. Recent studies on accurate unlexicalized parsing models outline two dimensions of parametrization. The first, proposed by [10], is the annotation of parent categories, effectively conditioning on aspects of a node’s generation history, and the second encodes a head-outward generation process [4] in which the head is generated followed by outward Markovian sister generation processes. Klein and Manning [11] systematize the distinction between these two forms of parametrization by drawing them on a horizontal-vertical grid: parent-ancestor encoding is vertical (v) (external to the rule) whereas head-outward generation is horizontal (h) (internal to the rule). By varying the value of the parameters along the grid they tune their treebank grammar to achieve better performance. This two-dimensional parametrization was shown to improve parsing accuracy for English [4, 1] as well as other languages, e.g., German [7] Czech [5] and Chinese [2]. However, results for languages different than English still lag behind. We claim that for various languages including the Semitic family, e.g. Modern Hebrew (MH) and Modern Standard Arabic (MSA), the horizontal and vertical dimensions of parameters are insufficient for encoding linguistic information relevant for breaking false independence assumptions. In Semitic languages, arguments may move around rather freely and the phrase-structure of clause level categories is often shallow. For such languages agreement features play a role in disambiguation at least as important as vertical and horizontal histories. Here we propose to add a third dimension of parametrization that encodes morphological features orthogonal to syntactic categories, such as those realizing syntactic agreement. These features are percolated from surface forms in a bottom-up fashion and they express information that is orthogonal to the previous two. We refer to this dimension as depth (d) as it can be visualized as a dimension along which parallel tree structures labeled with syntactic categories encode an increasing number of morphological features at all levels of constituency. These structures lie in a three-dimensional coordinate-system we refer to as (v, h, d). This work focuses on MH and explores the empirical contribution of the three dimensions of parameters to analyzing different syntactic categories. We present extensive experiments that lead to improved performance as we increase the number of dimensions which are exploited across all levels of constituency. In the next section we review characterizing aspects of MH (and other Semitic languages) highlighting the special role of morphology and the kind of dependencies witnessed by morphosyntactic processes. In section 3 we describe the method and procedure for the empirical evaluation of unlexicalized parsing models for MH. In section 4 we report and analyze our results, and in section 5 we conclude. 2 Typically accompanied with various category-splits and lexicalization. 3 The learning curves over increasing training data (e.g., for German [7]) show that treebank size cannot be the sole factor to account for the inferior performance. Accurate Unlexicalized Parsing for Modern Hebrew 3 2 Dimensions of Modern Hebrew Grammar 2.1 Modern Hebrew Structure Phrases and sentences in MH, as well as Arabic and other Semitic languages, have a relatively flexible phrase structure. Subjects, verbs and objects can be inverted and prepositional phrases, adjuncts and verbal modifiers can move around rather freely. The factors that affect word-order in the language are not necessarily syntactic and have to do with rhetorical and pragmatic factors as well. To illustrate, figure 1 shows two syntactic structures that express the same grammatical relations yet vary in their order of constituents. The level of freedom in the order of internal constituents also varies between categories, and figure 1 further illustrates that within noun-phrase categories determiners always precede nouns.
منابع مشابه
Three-Dimensional Parametrization for Parsing Morphologically Rich Languages
Current parameters of accurate unlexicalized parsers based on Probabilistic ContextFree Grammars (PCFGs) form a twodimensional grid in which rewrite events are conditioned on both horizontal (headoutward) and vertical (parental) histories. In Semitic languages, where arguments may move around rather freely and phrasestructures are often shallow, there are additional morphological factors that g...
متن کاملEnhancing Unlexicalized Parsing Performance Using a Wide Coverage Lexicon, Fuzzy Tag-Set Mapping, and EM-HMM-Based Lexical Probabilities
We present a framework for interfacing a PCFG parser with lexical information from an external resource following a different tagging scheme than the treebank. This is achieved by defining a stochastic mapping layer between the two resources. Lexical probabilities for rare events are estimated in a semi-supervised manner from a lexicon and large unannotated corpora. We show that this solution g...
متن کاملIncreasing Accuracy While Maintaining Minimal Grammars in Cky Parsing
Significant work in both lexicalized and unlexicalized parsing has been done in the past ten years. F1 measures of accuracy of over 90% have been achieved (Bikel, 2005), and linguistic notions of lexical dependencies and using head words have been harnessed to create significant improvements in probabilistic CFG note, however, that many of the techniques for improving lexicalized parsing create...
متن کاملFast and Accurate Unlexicalized Parsing via Structural Annotations
We suggest a new annotation scheme for unlexicalized PCFGs that is inspired by formal language theory and only depends on the structure of the parse trees. We evaluate this scheme on the TüBa-D/Z treebank w.r.t. several metrics and show that it improves both parsing accuracy and parsing speed considerably. We also show that our strategy can be fruitfully combined with known ones like parent ann...
متن کاملImproved Inference for Unlexicalized Parsing
We present several improvements to unlexicalized parsing with hierarchically state-split PCFGs. First, we present a novel coarse-to-fine method in which a grammar’s own hierarchical projections are used for incremental pruning, including a method for efficiently computing projections of a grammar without a treebank. In our experiments, hierarchical pruning greatly accelerates parsing with no lo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007